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We propose a general method for constructing confidence inter- 
vals and statistical tests for single or low-dimensional components of 
a large parameter vector in a high-dimensional model. It can be easily 
adjusted for multiplicity taking dependence among tests into account. 
For linear models, our method is essentially the same as from Zhang 
and Zhang [37]: we analyze its asymptotic properties and establish 
its asymptotic optimality in terms of semiparametric efficiency. Our 
method naturally extends to generalized linear models with convex 
loss functions. We develop the corresponding theory which includes a 
careful analysis for Gaussian, sub-Gaussian and bounded correlated 
designs. 



1. Introduction. Much progress has been made over the last decade 
in high-dimensional statistics where the number of unknown parameters 
greatly exceeds sample size. The vast majority of work has been pursued for 
point estimation such as consistency for prediction [14, 3], oracle inequalities 
and estimation of a high-dimensional parameter [7, 6, 36, 33, 23, 2, 25, 16] 
or variable selection [21, 38, 11, 34]. Other references and exposition to a 
broad class of models can be found in [12] or [5]. 

Very few work has been done for constructing confidence intervals, statis- 
tical testing and assigning uncertainty in high-dimensional sparse models. 
A major difficulty of the problem is the fact that sparse estimators such as 
the Lasso do not have a tractable limiting distribution: already in the low- 
dimensional setting, it depends on the unknown parameter [17] and hence 
the convergence to the limit is not uniform. Furthermore, bootstrap and even 
subsampling techniques are plagued by non-continuity of limiting distribu- 
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tions. While in the low-dimensional setting, a modified bootstrap scheme 
has been proposed [8] , it is unclear whether such a method can be extended 
to high-dimensional scenarios. 

Some approaches for quantifying uncertainty include the following. The work 
in [35] implicitly contains the idea of sample splitting and corresponding 
construction of p- values and confidence intervals, and the procedure has 
been improved by using multiple sample splitting and aggregation of depen- 
dent p- values from multiple sample splits [24]. Stability Selection [22] and 
its modification [27] provides another route to estimate error measures for 
false positive selections in general high-dimensional settings. Prom another 
and mainly theoretical perspective, the work in [16] presents necessary and 
sufficient conditions for recovery with the Lasso {3 in terms of — /3 ||oo? 
where /3 denotes the true parameter: bounds on the latter, which hold with 
probability at least say 1 — a, could be used in principle to construct (very) 
conservative confidence regions. Other recent work is discussed in Section 
1.1 below. 

We propose here a method which enjoys optimality properties when making 
assumptions on the sparsity and design matrix of the model. For a linear 
model, the procedure is largely the same as the one in [37] and closely related 
to the method in [15]. It is based on the Lasso and is "inverting" the cor- 
responding KKT conditions. This yields a non-sparse estimator which has 
a Gaussian (limiting) distribution. We show, within a sparse linear model 
setting, that the estimator is optimal in the sense that it reaches the semi- 
parametric efficiency bound. Our procedure can be used and is analyzed for 
high-dimensional sparse linear and generalized linear models and for regres- 
sion problems with general convex (robust) loss functions. 

1.1. Related work. Our work is closest to [37] (and also [15], see below) 
who proposed the semiparametric approach for distributional inference in 
a high-dimensional linear model. We take here a slightly different view- 
point, namely by inverting the KKT conditions from the Lasso, while relaxed 
projections are used in [37]: we describe in Section 2.4 the exact relations. 
Furthermore, our paper extends the results in [37] by: (i) treating generalized 
linear models and general convex loss functions; (ii) for linear models, we give 
conditions under which the procedure achieves the semiparametric efficiency 
bound and our analysis allows for rather general Gaussian, sub-Gaussian and 
bounded design. A related approach as in [37] was proposed in [4] based on 
Ridge regression which is clearly sup-optimal and inefficient with a detection 
rate (statistical power) larger than n -1 / 2 . 
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Very recently, and developed independently, the work in [15] provides a de- 
tailed analysis for linear models (but not covering generalized linear models 
as we do here) by considering a very similar procedure as in [37] and in 
our paper. They show that the detection limit is indeed in the l/y^-range 
and they provide a minimax test result; furthermore, they present extensive 
simulation results indicating that the Ridge-based method in [4] is overly 
conservative, which is in line with the theoretical results. Their optimality 
results are interesting and are complementary to the semiparametric opti- 
mality established here. Our results cover a substantially broader range of 
non-Gaussian designs in linear models, and we provide a rigorous analysis 
for correlated designs with covariance matrix the SDL-test in [15] as- 

sumes that X is known while we carefully deal with the issue when has 
to be estimated (and arguing why e.g. GLasso is not good for our purpose). 

Another way and method to achieve distributional inference for high-dimen- 
sional models is given in [1] (claiming semiparametric efficiency). They use a 
two-stage procedure with a so-called post-double-selection as first and least 
squares estimation as second stage: as such, their methodology is radically 
different from ours. 

2. High-dimensional linear models. Consider a high-dimensional lin- 
ear model 

(1) y = X/3° + e, 

with n xp design matrix X = [Xi, . . . , X p ] (nxl vectors Xj), e ~ M n (0, o\T) 
and unknown regression p x 1 vector 0°. Throughout the paper, we assume 
that p > n. We denote by Sq = {j; 0® / 0} the active set of variables and 
its cardinality by sq = \So\. 

Our main goal is pointwise statistical inference for the components of the 
parameter vector 09 (j = 1, . . . ,p) but we also discuss simultaneous inference 
for parameters G = {0j, j £ G} where G C {1, . . . ,p} is any group. To 
exemplify, we might want to test statistical hypotheses of the form Hqj : 
0j=Oov Hq^g '■ 0j = for all j € G, and when pursuing many tests, we aim 
for an efficient multiple testing adjustment taking dependence into account 
and being less conservative than say the Bonferroni-Holm procedure. 

2.1. The method: de-sparsifying the Lasso. The main idea is to invert the 
Karush-Kuhn- Tucker characterization of the Lasso. We will discuss in Sec- 
tions 2.4 some alternative representations. 
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The Lasso [29] is defined as: 

(2) $ = /3(A) = argmin^ 6RP (||Y - X/?||l/(2n) + A||/3||i). 

It is well-known that the estimator in (2) must fulfill the Karush-Kuhn- 
Tucker (KKT) conditions: 

-X T (y - X/3) + Af = 0, 

Halloo < 1, and fj = sign(/3j) if / 0. 

The vector f is arising from the sub-differential of \\f3\\i~- using the first 
equation, we can always represent it as 

(3) Af = X T (Y - X/3). 

The KKT conditions can be re-written with the notation £ = n _1 X r X: 
£0-p°) + \t = X. T e/n. 

The idea is now to use a "relaxed form" of an inverse of X. Suppose that 
is a reasonable approximation for such an inverse, then: 

(4) /3 - /3° + 0Af = 6X r e/n - A, where A = (0£ - 1)0 - /3°). 

We will show in Theorem 2.2 that A is asymptotically negligible under some 
sparsity assumptions. This suggests the following estimator: 

(5) b = $ + 9Af = p + 6X T (y - X/3) /n, 

using (3) in the second equation. This is essentially the same estimator 
as in [37], as discussed in Section 2.4, and it is of the same form as the 
SDL-procedure in [15], when plugging in the estimate for the population 
quantity = S" 1 . With (4), we immediately obtain an asymptotic pivot 
when y^A is negligible, as is justified in Theorem 2.2 below: 

(6) Vn(6-/3°) = W + o P (l), W\X~N p (0,a*e£O T ). 

An asymptotic pointwise confidence interval for /3? , when conditioning on 
X (or for fixed X), is then given by: 

[bj - c(a,n,a E ),bj + c(a,n,a e )], 

c{a, n, a £ ) = <f>-\l - a/2)n~ 1 / 2 ( T £ ^(0S0 r )^ 

If a £ is unknown, we replace it by a consistent estimator as discussed in 
Section 2.5.1. 
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2.1.1. The Lasso for nodewise regression. A prime example to construct 
the approximate inverse O is given by the Lasso for the nodewise regression 
on the design X [21]: we use the Lasso p times for each regression problem 
Xj versus X_j, where the latter is the design sub-matrix without the jth 
column. For each j = 1 . . . ,p, 

(7) 7j = argmin 7 (||X,- - X_ i7 |||/(2n) + A y || 7 ||i), 

with components of 7 j = {jj k'j k = 1, . . . ,p, k ^ j,}. Denote by 

/ 1 -71,2 ••• ~%p\ 



C 



"72,1 1 • • • ~l2,p 
->,2 ••• 1 / 



and by 

f 2 = diag^ 2 , . . . , f p 2 ), ff = (X, - X-tfjfXj/n. 
Then, define 

(8) €>Lasso = T~ 2 C. 

Not that although S is self-adjoint, its relaxed inverse, 0Lasso, is not, In the 
sequel, we denote by 

(9) ^Lasso = the estimator in (5) with the nodewise Lasso from (8). 

We consider the jth row of 0, denoted by Qj (as a p x 1 vector), and 
analogously for Cj. Then, 0Lassoj = Cj/f 2 . Furthermore, because of the 
choice of f 2 we have 

(10) XjX9 LassoJ /n = 1. 
Moreover, by the KKT conditions for (7): 

HX^-XeLassoj/^lloo < Aj/fJ. 

Hence we have 

||SGj - ejHoo < Aj/f|, 

where ej is the j-th. unit vector. We call this the extended KKT conditions. 

We note that using e.g. the GLasso estimator for seems not optimal be- 
cause (10) would fail and does not directly lead to desirable componentwise 
properties of the estimator b in (5) as established in Section 2.3. 
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2.2. Theoretical result for fixed design. We provide here a first result for 
fixed design X. A crucial identifiability assumption on the design is the 
so-called compatibility condition [30]. For a p x 1 vector j3 and a subset 
S C {l,...,p}, define /3 S by: 

faj = Pjl(j G 5), j = l,...,p. 

Thus, /3g has zeroes for the components outside the set S. The compatibility 
condition requires a positive constant (j)Q > such that for all /3 satisfying 
||/35g||i<3||/3 So ||i: 

The value ^>q is called the compatibility constant. We make the following 
assumption: 

(Al) The compatibility condition holds (for S) with compatibility constant 
0q > 0. Furthermore, Sjj < M 2 < oo for some < M < oo. 

The assumption (Al) is briefly discussed in Section 2.3.2. We then obtain 
the following result. 

Theorem 2.1. Consider the linear model in (1) with Gaussian error e ~ 
A/" n (0, c 2 /), and assume (Al). When using the Lasso for nodewise regression 

in (8) with Xj = A max Vj and i/te Lasso in (2) with A > 2Mo e \J t +2 ^ og ^ p I ; 

^Lasso - /9° = e Lasso X T e/?i + A, 

P[|| A||oo < ||T'- 2 || 00 A ini *4A*o/^] > 1 - 2exp(-t 2 /2), 



where ||A||oo = maxjfc |-Ai fc| is the element-wise sup-norm for a matrix A. A 
proof is given in Section 5.2. 

Theorem 2.1 gives a probabilistic bound for the error HAH^. Note that 
since the design is fixed, ©Lasso is fixed and non-random, and \\T \\cg is 
observed, and hence we compare HAHoo to a known constant. We will show 
in the proof of Theorem 2.2, see Lemma 5.3 in Section 5, that ||T _2 ||oo is 
asymptotically bounded and that the correct normalization factor for 6L asS o 
is \/n. Thus, asymptotically, when choosing A max x A x -y/log(p)/n, and if 
so \og{p)n~ 1 / 2 = o(l), the error term HAHoo = op(n -1 / 2 ) is negligible and 
V^Lasso - 0°) ~ AA p (0,a 2 e L assoSG T ). The details are discussed next. 
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2.3. Conditioning on random design and optimality. In order to make some 
further theoretical statements than in Theorem 2.1, we consider an asymp- 
totic framework with random design, but where the analysis for the sta- 
tistical inference is pursued conditioning on the design. The latter is the 
common approach for low-dimensional linear models and implemented as 
the standard procedure in software packages. Some results then follow for 
random design as well. 

The asymptotic framework uses a scheme where p = p n > n — > oo in model 

(I) and thus, Y = Y n , X = X n , /3° = and a 2 = a 2 en are all (potentially) 
depending on n. In the sequel, we usually suppress the index n. We make 
the following assumption. 

(A2) The rows of X are i.i.d. realizations from a Gaussian distribution 
Px whose p-dimensional covariance matrix E has smallest eigenvalue 
A min > > 0, and Halloo = max iifc \T, jk \ = 0(1). 

The Gaussian assumption is relaxed in Section 2.3.4. We will assume below 
some sparsity with respect to rows of = S" 1 and define: 

Sj = ^/(6 jfc /0). 

Mi 

We then have the following main result. 

Theorem 2.2. Consider the linear model (1) with Gaussian error e ~ 
J\f n (0,a 2 I). Assume (A2) and the sparsity assumptions sq = o(n 1 / 2 / log(p)) 
and Sj < s max = o{n/\og{p)) Vj. Consider the choice of the regulariza- 
tion parameters A x ^\og(p) jn for the Lasso in (2) and Xj = A max X 
\/log(p) jn Vj for the Lasso for nodewise regression in (8). Then: 

V^Lasso - P°) = W n + A n , 

w n \x^Af p (o,a 2 £ n n ), n n = ete T , 

IIAnHoo = Op(l). 

Furthermore, \\Q n — S -1 ^ = op(l) as n — > oo. 
A proof is given in Section 5.5. 

Theorem 2.2 has various implications. For one-dimensional components, we 
obtain for all x £ R: 

(II) Pfv^Lassoy - $)/(v e fi]}) < x\X] - $(x) = Op(l) (n -> oo), 
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where $(•) denotes the c.d.f. of A/"(0, 1). Since the the limiting distribution 
is independent of X, the statement also holds unconditionally for random 
design. Furthermore, for any group G C {1, . . . ,p} which is potentially large, 
we have that for all 

P[max| v / ^(S Lasso;i - < x|X] - P[max|Ty n;j | < x|X] = o P (l). 

Therefore, the asymptotic distribution of maxj^c \/^| &Lasso;j I |X under the 
null-hypothesis Hq^g', = V j G G is asymptotically equal to the maxi- 
mum of dependent Gaussian variables max jG G |Wn;j||X whose distribution 
can be easily simulated since Q n is known, see also Section 2.5. 

Remark 2.1. Theorem 2.2 can be extended to allow for non-Gaussian er- 
rors: v^^Lasso- P°) = W n + A n , With HAnlloo = Op(l), Cov(W n \X) = a 2 £l n 

but W n |X generally not exactly Gaussian. Often though, a central limit 
theorem argument can be used to obtain approximate Gaussianity of low- 
dimensional components o/W n |X, see also Section 3.2. 

2.3.1. Uniform convergence. The statements of Theorem 2.2 also hold in a 
uniform sense, and thus, the derived confidence intervals and tests in Section 
2.5 below are honest [19]. We consider the set of parameters 

B(s) = {PeW>; \\f3\\° <s}. 

We then have the following for &Lasso in (9). 

Corollary 2.1. Assume the conditions of Theorem 2.2 with /3 E B(sq) 
and so = o{n 1 / 2 / log(p)). Then, when using A x y/\og(p)/n for the Lasso in 
(2), and Xj = A max X y / log(p)/n Vj for the Lasso for nodewise regression 
in (8): 

V^Lasso " P°) = W n + A n , 

w n \x~M p (o,o- 2 £ n n ), n n = ete T , 

sup || Allco = Op(l). 

j3°£B(sa) 

Moreover, as in Theorem 2.2, \\Q n — S -1 ^ = op(l) (n — > oo). 

The proof is exactly the same as for Theorem 2.2 by simply noting that 
sw PpOeB(s ) \\P ~ 111 = Cip( s o\/log(p) /n) (with high probability, the com- 
patibility constant is still bounded from below by L/2). □ 
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Corollary 2.1 implies that for j 6 {1, . . . ,p}: 



SUp |P[^(&Lasso;i " 0i)/{<T e J p-^jj) < x\X] - $(x)| = Op(l) (n -> Oo). 

/3°eB( s ) v 

2.3.2. Discussion of the assumptions. The compatibility condition (Al) is 
weaker than many others which have been proposed such as assumptions on 
restricted or sparse eigenvalues [31]: a slight relaxation by a constant factor 
has recently been given in [28]: we could work with this slightly less estab- 
lished condition without changing the asymptotic behavior of our results. 
Assumption (A2) is rather weak as it concerns the population covariance 
matrix. 

Regarding the sparsity assumption for sq in Theorem 2.1, our technique 
crucially uses the ^i-norm bound — = Op (sq y/log(p) jn) , see Lemma 
5.1. In order that this £i-norm converges to zero, the sparsity constraint 
so = o(y / n/ log(p)) is usually required. Our sparsity assumption is slightly 
stricter by the factor log(p) -1 / 2 (because the normalization factor is ^fn), 
namely sq = o(log(p)~ 1 / 2 y/n/ log(p)) = o(n 1 / 2 / log(p)). 

2.3.3. Optimality and semiparametric efficiency. Corollary 2.1 establishes, 
in fact, that for any j, OLassoj is an asymptotically efficient estimator of 
j3j, in the sense that it is asymptotically normal, with asymptotic variance 
converging, as n — > oo to the variance of the best estimator. Consider, the 
one-dimensional sub-model, 

(12) Y = (3j (Xj - X_,- 7 °) + X_j (Plj + /3°X_,- 7 °) + s, 

where 7° is the population analog of jj, i.e., Xj — X_j7° is the projection of 
Xj to the subspace orthogonal to X_j. Clearly, this is a linear submodel of 
the general model (1), passing through the true point. The Gauss-Markov 
theorem argues that the best variance of an unbiased estimator of /3j in 
(12) is given by o~ 2 /Var(Aj — X_j7^). However, Corollary 2.1 shows that 

this is the asymptotic variance of &Lassoj- Thus, OLassoj is asymptotically 
normal, with the variance of the best possible unbiased estimator. Note, that 
any regular estimator (regular at least on parametric sub-models) must be 
asymptotically unbiased. 

The main difference between this and most of the other papers on complex 
models is that usually the Lasso is considered as solving a nonparametric 
model with parameter whose dimension p is increasing to infinity, while we 
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consider the problem as a semiparametric model in which we concentrate on 
a low dimensional model of interest, e.g., P®, while the rest of the parameters, 
/?!?. -, are considered as nuisance parameters. That is, we consider the problem 
as a semiparametric one. In the rest of this discussion we try to put the 
model in a standard semiparametric framework in which there is an infinite 
dimensional population model. 

Without loss of generality, the parameter of interest is (3®, i.e., the first 
component. Consider the random design model 



(13) Y = P^Xx + K(Z) + e, e ~ A/"(0, a\ 



eli 



where Z is a Gaussian process. Suppose that with sample size n, we observe 
a sample from Y, X™ , . . . , X™ n such that 



(14) 



Y = zZ P1 X 1 + e?1 ' ^ independent of X? , . . . , X% 

3=1 

i=2 

3=2 

3=2 j=2 

Theorem 2.3. Suppose (14) and the conditions of Theorem 2.2 are satis- 
fied, then 



1 n 

£>Lasso;l = Pi + ~ ^(^li " E[X li \Z l ])e i + P (n" 



1/2^ 



n 
i=i 



In particular, the limiting variance of \/n(bhasso;i — Pi) reaches the infor- 
mation bound a^/E[(Xi — K[Xi\Z]) 2 ]. Furthermore, 6Lasso;i is regular at the 
one- dimensional parametric sub-model with component Pi and hence, 6Lasso;i 
is asymptotically efficient for estimating P®. 



A proof is given in Section 6.1. 
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As a concrete example, condition (14) and the conditions of Theorem 2.2 
are satisfied when: 

oo 

Y = PjXj + e, e ~ AA(0, a 2 £ ), (3° e 5, (X.OisN 
i=i 

(15) X^A, Vj = l,...,p n , 

where B = {(^n; < oo}, 

V = {P x Gaussian on R°°; 

< mini/ min (S5s), max|S J:; | < oo, 

and Vj, E[X,-] = 0, ||7;(i^)||o < ~}- 

Thereby, Hs,s is the covariance matrix of {Xj; j G S 1 }, and A^ in (-) de- 
notes the minimal eigenvalue. Furthermore, jj(Px) — argmin 7 Ep x [(Xj — 
Ylk^j IkXk) 2 ] are the coefficients from the projection of Xj on all other in- 
finitely many variables {A^; k 7^ j}. The assumption about the covariances 
is equivalent to saying that (Xj)j £ ^ has a positive definite covariance func- 
tion. A proof that this example fulfills the required assumptions is given in 
Section 6.1. 

2.3.4. Non-Gaussian design. We extend here Theorem 2.2 to allow for non- 
Gaussian designs. Besides covering a broader range of designs for linear 
models, the result is important for the treatment of generalized linear models 
in Section 3. 

Consider a random design matrix X with i.i.d. rows having mean zero and 
population covariance £ with its inverse (assumed to exist) = Denote 
by 7j = argmin 7 E[|| Aj — X_j7|||] (which was denoted by jj(Px) in the 
previous subsection). Define the error rjj := Xj — X_j7j with variance r? = 
Edl^Hl/Vi] = 1/Qjj. We make the following assumptions. 

(Bl) The design X has either i.i.d. sub-Gaussian rows or i.i.d. rows and 
for some K > 1, HXH^ = maxjj |Xj,-| = 0(K). The latter we call the 
bounded case. The strongly bounded case assumes in addition that 
||X_j7j ||oo = O(K). We write Kq = 1 in the sub-Gaussian case and 
Kq = K in the (strongly) bounded case (where Kq appears in some of 
the conditions below). 
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(B2) It holds that KqSj y / log(p)/n = o(l). In the sub-Gaussian case we 
relax this to ysj log(p)/n = o(l). 

(B3) The smallest eigenvalue of £ satisfies < L < A^ in and ||£||oo = 0(1). 

(B4) It holds that Rrfij = <D(K$). 

We note that the strongly bounded case in (Bl) follows from the bounded 
case if ||7j||i = 0(1), and the sub-Gaussian assumption is stronger. Further- 
more, in the sub-Gaussian case or the strongly bounded case, the assumption 
(B4) follows automatically. Assumption (B2) is a standard sparsity assump- 
tion for 0. Finally, assumption (B3) implies that ||0j||2 < — = 
0(1) uniformly in j (see (34)), so that in particular rj = l/Qjj stays away 
from zero. Note that (B3) also implies r? < = 0(1) uniformly in j. 

Theorem 2.4. Suppose the conditions (B1)-(B4) hold. Denote by & andfj 
the estimates from the nodewise Lasso in (8). Then for Xj x Koy / log(p)/n 
we have: 

lie, - e,|, = cfa^S) , lie, - e ih = Op (* /«), 

Furthermore, 

- Qjji < \m\oo\\@j - QjWi a aLxIIQj - ©illl + 2|f| - t% 

where A^ ax is the maximal eigenvalue ofH.If the conditions hold uniformly 
in j then in the sub- Gaussian or strongly bounded case the results are also 
uniform in j. 

Finally, for the sub-Gaussian or strongly bounded case, if the conditions 
hold uniformly in j, Kos m3jX ^\og(p) /n = o(l) and s$ = o(n 1 / 2 / log(p)), the 
statements of Theorem 2.2 hold. 

A proof is given in Section 5.6. 

2.4. Linear projection estimator and bias correction with the Lasso. We 
discuss here a relation to the method in [37]. The estimator in (5) has an 
(almost) equivalent representation. Consider any n x 1 score vector Zj, for 
j = 1 . . . , p. Pursuing a linear projection of Y onto Zj we obtain: 

ZjY = ZjXjP" + £ ZjX k pl + Zje/n. 
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Therefore, 



Z J Y _ R o = ST Z ^ Xk rQ + z I £ 



ZjX k Q 



with a bias term Ylk^j z T x When estimating the bias with the Lasso, 

3 3 

we obtain the following estimator: 



( 16 ) Vojy - ~ z T X ^ k 



ZjY ZjX k 

Similarly as in (4), we can derive an approximate pivot (for fixed design): 
Vn(b pio j - = W - \/reA proj 



Z T Z 

W = M p {0, of fiproj). (^proj)jfc 



k 



{zj Xj ){zix k y 

(17) A proj;J = y, ?nr0* - ft) U = h--.,P). 

Mi i j 

A typical choice of a score vector is Zj = Xj — X_j7j where 7 is an estimated 
vector of coefficients when regressing Xj versus X_j: a prominent example 
is the nodewise Lasso in (7) in Section 2.1.1 

(18) -^Lasso;j = Xj — X_j7Lasso;j • 

Another choice is using from Section 2.1. 

(19) Z = XG T , 

whose jth column vector build the nxl score vectors Zj (j = 1, . . . ,p). 

The estimators in (5) and (16) are equivalent whenever the vectors Zj in 

(19) and in (16) coincide, and if ZjXj/n = 1 for all j. The latter is true 
when enforcing, for each j, 

(20) (et) n = 1 

for the estimator in (5); for (16), we can always re-scale Zj such that 
ZjXj/n = 1 (by re-scaling Zj <- Zj/{ZjXj/n)). Thus, the main condi- 
tion to make the estimators equal is (20), and it holds for the nodewise 
Lasso as shown in (10). 

13 



2.5. Confidence intervals and hypothesis testing. We assume in this section 
an estimator b which satisfies: 

V^(b ~ P°) = W n + A n , 

(21) ||A n || 00 = o P (l), W n \X~N p (0,a 2 £ n). 

This holds for &Lasso in (5) assuming sparsity and design conditions as dis- 
cussed in Theorem 2.2. 

Confidence intervals and statistical hypothesis tests, when conditioning on 
X, can be immediately derived from such an approximate pivot. For one- 
dimensional parameters (3®, the two-sided confidence interval and statistical 
test for Hqj : /3j = are given by 

I i = [bj - fc-^l - a/2)a ey ^l~,b j + - a/2)a £ ^/Q~]\ 

and the p-value 

(22) Pj =2(1- H^f=)) , 

where &j ;0 bserv denotes the observed value of the statistic bj. 

For simultaneous inference, we focus on testing i?o,G : Pj = for all j € G 
versus the alternative Ha,g '■ Pj^® f° r a t least one j £ G, where G C 
{1, . . . ,p} is an arbitrary set. As a concrete test-statistic, consider 

To = max 1 6,- 1 / a £ 

whose distribution under i?o,G can be approximated by Tw,G = max je G \ Wj\/a £ . 
Its distribution Jg(c) = IP [max j g q \ Wj \ / o~ e < c] can be easily simulated by 
generating Gaussian variables W ~ -A/|g|(0, n~ l Q,G,c) which do not involve 
a £ . Denote by 7G;observ = max jeG y/n\bj- ohscn / a e \. Then, the p-value for 
i?o,G) against the alternative being the complement Hqq, is defined as 

(23) P G = 1- JG(7G;observ)- 

2.5.1. Estimation of a £ . For the construction of confidence intervals and 
hypothesis tests, we need an estimate for a £ . The scaled Lasso [28] yields 
a consistent estimator for this quantity, under the assumptions made for 
Theorem 2.2. We then simply plug-in an estimate a e into (22) or into 7c for 
(23). 
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2.5.2. Multiple testing adjustment. Based on many single p-values, we can 
use standard procedures for multiple testing adjustment to control for vari- 
ous type I error measures. The representation from Theorem 2.1 or 2.2 with 
IIAII oo being sufficiently small allows to construct a multiple testing adjust- 
ment which takes the dependence in terms of the covariance f2 (see Theorem 
2.2) into account: the exact procedure is described in [4]. Especially when 
having strong dependence among the p-values, the method is much less con- 
servative than the Bonferroni-Holm procedure for strongly controlling the 
familywise error rate. 

3. Generalized linear models and general convex loss functions. 

We show here that the idea of de-sparsifying £i-norm penalized estimators 
and corresponding theory from Section 2 carries over to models with convex 
loss functions such as generalized linear models (GLMs). 

3.1. The setting and de-sparsifying the l\-norm regularized estimator. We 
consider the following framework with 1 x p vectors of covariables X{ G X C 
W and univariate responses yi G y C R for i = 1, . . . ,n. As before, we 
denote by X the design matrix with ith row equal to Xj. At the moment, we 
do not distinguish whether X is random or fixed (e.g. when conditioning on 
X). 

For ?/ G J and x G X being a 1 x p vector, we have a loss function 

Pf }(y,x) = p(y, x(3) (/3 G R p ) 

which is assumed to be a strictly convex function in /3 G R p . We now define 

d d 

where we implicitly assume that the derivatives exist. For a function g : 
3^ x X — > R, we write P n g := Yl^idiyi^i)/ 71 an d Pg := E,P n g. Moreover, 
we let \\g\\l := P n g 2 and \\g\\ 2 := Pg 2 . 

The ^i-norm regularized estimator is 

(24) /3 = argmin^(P ri ^ + A||^||i). 

As in Section 2.1, we de-sparsify the estimator. For this purpose, define 



(25) 



£ := P n pp. 
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Note that in general, E depends on f3 (an exception being the squared error 
loss). We construct = G>Lasso by doing a nodewise Lasso with E as input 
as detailed below in (29). We then define 

(26) b := $ - eP n pp. 

The estimator in (5) is a special case of (26) with squared error loss. 

3.1.1. Lasso for nodewise regression with matrix input. Denote by S a ma- 
trix which we want to approximately invert using the nodewise Lasso. For 
every row j, we consider the optimization 

(27) 7j = argmiii. i>: ;/ - 2E jAj7j + jjt^jj + A^Hi), 

where denotes the jth row of E without the diagonal element (j, j), 
and Eyj ^j is the submatrix without the jth row and jth column. We note 
that for the case where E = X T X/n, 7 is the same as in (7). 

Based on jj from (27), we compute 

(28) ff = E ii -E Mi 7 i . 

Having 7^ and fj from (27) and (28), we define the nodewise Lasso as 

(29) ©Lasso as in (8) using (27)-(28) from matrix input E in (25). 
Moreover, we denote by 

^Lasso = b from (26) using the nodewise Lasso from (29). 

Computation of (27) and hence of G can be done efficiently via coordinate 
descent using the KKT conditions to characterize the zeroes. Furthermore, 
an active set strategy leads to additional speed-up. See for example [13] and 
[20]. 

3.2. Theoretical results. We show here that the components of the estima- 
tor b in (26), when normalized with the easily computable standard error, 
converge to a standard Gaussian distribution. Based on such a result, the 
construction of confidence intervals and tests is straightforward. 

Let j3° S W be the unique minimizer of Ppp with sq denoting the num- 
ber of non-zero coefficients. We use analogous notation as in Section 2.3 
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but with modifications for the current context. The asymptotic framework, 
which allows for Gaussian approximation of averages, is as in Section 2.3 
for p = p n > n -)• oo and thus, y u . . . , y n = Y n , X = X n , /3° = /3° and 
underlying parameters are all (potentially) depending on n. As before, we 
usually suppress the corresponding index n. 

We make the following assumptions which are discussed in Section 3.3.1. 
Thereby, we assume (C3), (C5), (C6) and (C8) for some constant K > 1 
and furthermore, A, A* and s* are positive constants. 

(CI) The derivatives 

d d? 
p(y,a) '■= ^^>°)' P(v^ a ) := ^2 p ( y ' a )' 

exist for all y, a, and for some 5- neighborhood (6 > 0), p(y,a) is Lip- 
schitz: 

|p(y,«) - p(y,a)\ 

max sup sup : < 1. 

ao£{xip } \ a -ao\V\a-a \<S yey \a — a\ 



Moreover 



max sup sup \p(y, a)\ = Oil). 

a e{x z l3 } \ a -a \<S y&y 



(C2) It holds that ||/3 - = O ¥ (s X), ||X(/3 - /3°)|| 2 = O ¥ (s X 2 ), and 
||X(/3-/3°)||2 = O ¥ (s X 2 ). 

(C3) It holds that = maxjj |X#| = 0{K). 

(C4) It holds that \\P n p&Qj - = Op (A*). 

(C5) It holds that HXG^loo = ¥ (K) and ||Gj||i = OpCy^. 
(C6) It holds that \\(P n - P)p0opjp\\oo = O f (K 2 X). 
(C7) For every j, the random variable 

^/ri(@P n ppo)j 

converges weakly to a Af(0, l)-distribution. 
(C8) It holds that 

Ks X 2 = o(n~5), A*As = o(n~^), and K 2 s*X + if 2 \/i^A = o(l). 
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We note that often the regularization parameters in (27) are the same and A* 
can be chosen as A* = A max = Xj, see also Section 3.3.1. Furthermore, when 
assuming that a population = S _1 exists for S = Pppo, s* can be chosen 
as s max which is the maximal row sparsity of 0. The following main result 
holds for fixed or random design according to whether the assumptions hold 
for one or the other case. 

Theorem 3.1. Assume (C1)-(C8). For the estimator in (26), we have for 
each j e {l,...,p}: 

rfi(b j -$)/a j = W j + op(l) 1 
where Wj ~ Af{0, 1) and <r| = (@P n p0fZ@ T )jj. 

A proof is given in Section 5. Assumption (CI) of Theorem 3.1 means that 
we regress to the classical conditions for asymptotic normality in the one- 
dimensional case as in for example [9]. Assumption (C8) is a sparsity as- 
sumption: for K = 0(1) and choosing A*(= A max ) x A x ydog(p)/n the 
condition reads as so = o{yfn/ log(p)) (as in Theorem 2.2) and s*(= s max ) = 
o(yJn/ log(p)). All the other assumptions (C2)-(C7) follow essentially from 
the conditions of Corollary 3.1 presented later, with the exception that (C3) 
is straightforward to understand. For more details see Section 3.3.1. 

3.3. About nodewise regression with certain random matrices. We justify 
in this section most of the assumptions for Theorem 3.1 when using the 
nodewise Lasso estimator = 0L aS so as in (29) and when the matrix input is 
parameterized by f3 as for standard generalized linear models. For notational 
simplicity, we drop the subscript "Lasso" in G. Let be an n- vector with 
entries wi^ = wp(yi,Xi). We consider the matrix := WpX. where Wp = 
di&g(wp). We define Tip := X^X^/n. We consider 0^ . as the jth row of the 
nodewise regression = 0^ in (29) based on the matrix input Sa. 

We let Tp = E[XjX^/n] and define Q := Q^o := T~q (assumed to exist). 

Let Sj := spa i '■= ||0^",j||o- Analogous to Section 2.3.4, we let X^o -j^fpo j 
be the projection of X^o j on X^o 3 - using the inner products in the ma- 
trix Tpo and let Vp°,j := X^o j — X^o .^j. We then make the following 
assumptions. 

(Dl) The pairs of random variables {(2/1,2^4)}^=^ are i.i.d. and HXjloo — 
maxjj |Xjj| = O(K) and ||X7^ = 0{K) for some K > 1. 
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(D2) It holds that K 2 s jv / log(p)/n = o(l). 

(D3) The smallest eigenvalue of S^o is bounded away from zero and more- 
over || £00 1| oo = 0(1). (The latter is ensured by requiring that the 
largest eigenvalue is bounded from above). 

(D4) For some 5 > and all ||/3 — /3°||i < 5, it holds that wp stays away 
from zero and that Hi^sHoo = 0(1)- We further require that for all such 
/3 and all x and y 

\wp{y,x) - wpo(y,x)\ < \x0 - j3°)\. 

(D5) It holds that 

||X03 - (3°)\\ n = Op(A^), \\P - /3°||i = Op(Aso). 



Note that (D5) typically holds when Ay^ = o(l) with A x y/\og(p)/n 
since the compatibility condition is then inherited from (D3). We have the 
following result. 

Theorem 3.2. Assume the conditions (D1)-(D5). Then, using \j >c 
KsJ\og{jp)/n for the nodewise Lasso Qz: 



II ©4 j " e P°jh = Ow{Ks jy /log(p)/nj + O p \K 2 s ((X 2 /^log(p)/n) V A) J , 
11©^ " /3»,ill2 = Op(^^-log(p)/n) + Opf #V*A 

are<i 

\ f h ~ T P°,i\ = Op(^^log(p)/n^ + p (V^a) • 

Moreover, 

I s /3° 0,g j - © /30 j j I < 1 1 s^o 1 1 oo 1 1 j - e^o j 1 1 \ a A max 1 1 6g ^ - 6^0 j 1 1 2 +2 1 fj ^. - 

where A max is f /ie maximal eigenvalue of E «o . 

A proof, using ideas for establishing Theorem 2.4, is given in Section 6.2. 
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Corollary 3.1. Assume the conditions of Theorem 3.2, with A >c y / log(p)/n, 
K x 1, Sj = o(y/n/ log(p)) and sq = o(y/n / \og(p)) . Then 



H /3,j ~ /3°,illl = OP^l/\/log(p) 
H®/3j ~ /3°,ill2 = Op(^~" 1/4 ), 

Lemma 3.1. Assume the conditions of Corollary 3.1. Let for i = 1, . . . ,n, 
£j be a real-valued random variable and xf G W, and let (£j,£i)f =1 be i.i.d. 
Assume Kxf^i = and that < 1. T/ien 

n n 

L £ x ^ /n = £ + °^ 1/2 )- 

i=l i=l 

Let A := Kxfx^f (assumed to exist). Assume that pe°||oo = 0(1) and 
that l/((e°) T A9<?) = 0(1). TTiera 



Moreover, then 



eT.^. = Gjo J Ae^o J + 0p (i). 



|, EILi 



9T AO, . 

0,j fa 

convergences weakly to a N '(0,1) -distribution. 



A proof is given in Section 6.2 

3.3.1. Discussion of the assumptions for GLMs. Assumption (CI) is classi- 
cal [9] and (C3) is easy to understand. All the other assumptions (C2)-(C8) 
follow essentially from the conditions of Corollary 3.1 with E# := Ppp and 
w p(y->%) '■= p(y,x/3), provided we take 6 as the nodewise Lasso in (29), 
and s* = Sj and A* x Xj. We will discuss this for the case HX/? ^ = 
0(1) (that is we assume K = 1 for simplicity) and \p(y — xfl°)\ = 0(1) 
uniformly in x and y. We also need to assume l/(0j o ^Pppopp^QpOj) = 
0(1). Note that for the case of canonical loss, Pppopp = E^o, and hence 
V( 5> jPppf^aGpj) = r| .. 
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Condition (C2) holds because the compatibility condition is met as Ego is 
non-singular and 

||E — XgoHoo = Op (A*). 

The condition that p(y,x(3°) is bounded ensures that p(y,a) is locally Lip- 
schitz, so that we can control the empirical process (P n — P){pp — Ppo) as 
in [33] (see also [5] or [32]). (In the case of a GLM with canonical loss (e.g. 
least squares loss) we can relax the condition of a locally bounded derivative 
because the empirical process is then linear). Condition (C3) is assumed to 
hold with ||X||oo = O(l), and Condition (C4) holds with A* x ylogp/n. 
This is because in the node-wise regression construction, the 1/tf are consis- 
tent estimators of (EZo)jj (see Theorem 3.2). Condition (C5) holds as well. 

Indeed, HG^OjHi = 0(^/s]), and - @p°,j\\i = Ow(XjSj) = Op(y/sj). 

Condition (C6) holds as well, since we assume that \p\ = 0(1) as well 
as HXjloo = O(l). As for Condition (C7), this follows from Lemma 3.1, 
since |Oj jPpo (y, x)\ = |0j o jX T p(y, xj3°)\ = O(l), which implies for A := 
PppPpo that || AB^ojUoo = O(l). 

4. Conclusions. We derive confidence regions and statistical tests for low- 
dimensional components of a large parameter in high-dimensional models. 
We propose a general principle which is based on "inverting" the KKT con- 
ditions from ^i-penalized estimators. The method easily allows for multiple 
testing adjustment which takes the dependence structure into account. 

For linear models, the procedure is (essentially) the same as the projection 
method in [37]: we prove its asymptotic optimality in terms of semipara- 
metric efficiency, assuming certain sparsity conditions. For generalized lin- 
ear models with convex loss functions, we develop a substantial body of new 
theory which in turn justifies the general "KKT inversion" principle. 

The conditions we impose seem rather tight for our method. We require 
^o-sparsity of the underlying regression coefficient so = o(y/n / \og{p)) (as- 
suming bounded design for GLMs). This is essentially the condition for 
^i-norm convergence, and our method requires such an £i-norm bound; the 
additional factor l-y/log(p) stems from the fact that the asymptotic pivot 
6-/3° is normalized with y/n. Regarding the design, we assume that the 
minimal eigenvalue of the population covariance matrix E is bounded from 
below, and that the row-sparsity of © = S^ 1 satisfies s max = o(n/log(p)) 
for Gaussian or sub-Gaussian design or s max = o(^Jn/ log(p)) for bounded 
design, respectively. More generally, our methods needs two elements: the 
^i-norm convergence of the Lasso estimator /3 and the ^-norm of the jth 
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row of ©E — /, see (4) or (30). However, since the analysis is conditional on 
X, the latter is an observable quantity, and hence its bound in practice does 
not depend on the assumptions but is observed from data. 

4.1. Empirical results. We have done some empirical validation for two- 
sided testing of individual hypotheses Hqj : (3® = and corresponding 
multiple testing adjustment. Due to the length of this paper, we don't include 
the results but rather give a short summary of the findings (and we intend 
to provide an R-package with corresponding illustrations). 

We compared our de-sparsified estimator &Lasso with a bias-corrected Ridge 
estimator which has been proposed in [4] . As an overall conclusion, we find 
that our &Lasso estimator has more power than the Ridge-type method while 
still controlling type I error measures reasonably well, whereas the Ridge 
procedure yields conservative type I error control for a broader class of 
designs. We also considered the mean-squared error for estimating a single 
parameter (3®, and we compared &Lasso with the standard Lasso and the bias- 
corrected Ridge estimator [4]. We found that &Lasso is clearly better than the 
Ridge-type method. For the standard Lasso, we observe a "super-efficiency" 
phenomenon, namely that it estimates the zero coefficients often very accu- 
rately while estimation for the non-zero parameters is poor in comparison 

to ^Lassc 

5. Proofs and materials needed. 

5.1. Bounds for ||/3 — with fixed design. The following known result 
gives a bound for the ^i-norm estimation accuracy. 

Lemma 5.1. Assume a linear model as in (1) with Gaussian error and 
fixed design X which satisfies the compatibility condition with compatibility 
constant <p$ and with Sjj < M 2 < oo for all j. Consider the Lasso with 

regularization parameter A > 2Mo £ \j t +2 ^ og ( p I, Then, with probability at 
least 1 - 2exp(-i 2 /2): 

11/3 - /3°||i < 8A^ and ||X(/3 - f3°)\\ 2 2 /n < 8A 2 ^- 

n n 

A proof follows directly from the arguments in [5, Th.6.1] which can be 
modified to treat the case with unequal values of Tijj for various j. □ 
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5.2. Proof of Theorem 2.1. It is straightforward to see that 

(30) || = ||(e L assoS - 1)0 - /5°)||oo < ||(e L assoS - J)|U||/§ - ^ 

where ||^4||oo = max,^ \Aj k\ is the element-wise sup-norm for a matrix A. 

For bounding ||(©LassoS — I)||oo we invoke the KKT conditions for the Lasso 
for nodewise regression in (7): 

WXZjiXj - X-jTLasso^OHoo < Xj (j = 1, . . . ,p), 
and thus we obtain for G>Lasso from (8): 

(31) ||(e L assoS-/)||oo < IIAT^Hoo. 

where A = diag(Ai, . . . , X p ). The right-hand side of (31) can be bounded by 

II A rp— 2 || \ 117^—2 1| 

|| J1J - I |oo ^ /x ms,y.\\ ± || oo- 

Therefore, using the latter bound in (30) and the bound from Lemma 5.1 
completes the proof. □ 

5.3. Random design: bounds for compatibility constant and ||!T 2 ||oo - The 
compatibility condition with constant </>q being bounded away from zero is 
ensured by a rather natural condition about sparsity. We have the following 
result. 

Lemma 5.2. Assume that Px is Gaussian satisfying (A2). Furthermore, 
assume that sq = o(nj log(p)) (n — > oo). Then, with probability tending to 
one, the compatibility condition holds with compatibility constant 

0o > L/2 > 0. 
A proof follows directly as in [26, Th.l]. 
Lemma 5.1 and (5.2) say that we have a bound 



||/3-/3°||i = 0p(Wlog(p)M 
(32) ||X08 - := \\X0 - (3°)\\ 2 2 /n = O F (s log( P )/n) } 

when assuming (A2) for a Gaussian distribution Px and sparsity sq = 
o(n/log(p)). 

When using the Lasso for nodewise regression in (8), we would like to have 
a bound for ||T£1^ so ||oo appearing in Theorem 2.1. 
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Lemma 5.3. Assume (A2) with row- spar sity for G = £ 1 bounded by 

sj < Smax = o(n/ log(p)) for all j = 1, . . . ,p. 
Then, when choosing the regularization parameters Xj = A max x yJ\og{p)/n, 
II^lJsoIIoo = Cp(1) (n^oo). 

A proof follows using standard arguments. The compatibility assumption 
holds for each nodewise regression with corresponding compatibility con- 
stant bounded from below by L/2, as in Lemma 5.2. Furthermore, the pop- 
ulation error variance rj = E[(Aj — Y^k^j 7j,kXk) 2 ], where 7^ are the pop- 
ulation regression coefficients of Xj versus {X^; k ^ j} satisfy: for all j, 
if = w~ > A min > L > (see formula (35)) and r| < E[||X,-|| 2 /n] = 
Tijj < || XI || oo = O(l), thereby invoking assumption (A2). Thus, all the er- 
ror variances behave nicely and therefore, each nodewise regression satisfies 
\\Xj — X_j7j||2/n = 0p(sjTog(p)/n) (see Lemma 5.1 or (32) and hence the 
statement follows. □ 

5.4. Bounds for — /3° 1 1 2 with random design. As argued in Lemma 2, 
assuming so = so = o(n/ log(p)) (n — > 00), the compatibility condition holds 
with probability tending to one. Therefore, the weaker restricted eigenvalue 
condition [2] holds as well and assuming (A2) we have the bound (see [2]): 

(33) ||/3-/3°|| 2 = 0p(Vsolog(p)/n). 

5.5. Proof of Theorem 2.2. Invoking Theorem 2.1 and Lemma 5.3 we have 
that 

n 1/2 || A||oo < O P (s log(p)n- 1 / 2 ) = op(l) 
where the last bound follows by the sparsity assumption on so- 

What remains to be shown is that ||O n — @||oo = op(l), as detailed by the 
following Lemma. 

Lemma 5.4. Assume 

max||0j - ©j 1 1 1 = Op(A max Smax), max||0j - 6j|| 2 = Op(\ maxy /s max ). 
j j 

Suppose that A max s max = o(l). Then 

||^ - ©||oo = Op(l)- 
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Proof: We first show that for L > from (A2): 

(34) max \\<djh < L~ 2 < oo. 
j'=i»-iP 

Clearly, we have: 

( 35 ) e.^max— ^ = A"2 n . 



Furthermore, 



| O j 1 1 1 < -4« — - = -r#- < All < L~ 2 < oo, 



Al ,„ " Ai 



where we used (35) and assumption (A2) in the two last inequalities. There- 
fore, (34) holds. 

Using standard arguments, analogous to (32) and using Lemma 5.3, we have 
that 

max||6j - ©j- ||i = 0p(A max Smax) 
j 

Hence, uniformly in j: 

(36) ||%||l = ||©j||l + Cp(AmaxSmax) = Of>(y/s max ). 

Furthermore, we have 

(37) n = ete T = (@t - i)e T + o T 

and 

(38) ||(9£- /)e T |U < AnuHtllf-^loomaxll^Hi 

j 

= Cp(Amax\/w) = Op(l), 

where the second-last bound follows from Lemma 5.3 and (36). Finally, we 
have using standard arguments for the ^-norm bounds, see also (33): 

(39) ||0 - G||oo < max||9j - 0,-|| 2 < Amaxv^max = Op(l). 

3 

Using (37)-(39) we complete the proof. □ 



The proof of Theorem 2.2 is now complete using the fact that the sparsity 
assumptions and (A2) automatically imply that the compatibility condition 
holds for every nodewise regression (see also Lemma 5.2), and also the re- 
stricted eigenvalue condition holds [2] which allows for bounding max-,- ||©j — 
@j\\2 as worked out in [26, Th.l]. From this we deduce that the conditions 
in Lemma 5.4 about maxj \\Qj — @j\\ q (q = 1,2) hold. □ 
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5.6. Proof of Theorem 2.4- Under the sub-Gaussian assumption we know 
that rjj is also sub-Gaussian. So then \\rjj X-j/nW^ = Op(-y/log(p)/n). If 
11-^1 loo = C(-f^)) we can use the work in [10] to conclude that 



I^JX^/nlU = P (AVlog(p)/n). 

However, this result does not hold uniformly in j. Otherwise, in the strongly 
bounded case, we have 

IMoo < ll-xilloo + IIX-^JIloo = o{k). 

So then \\rjJ'X.-j/n\\ 00 = Of{K^/\og{p)/n) + Of(K 2 log(p)/n), which is uni- 
form in j. 

Then by standard arguments (see e.g. [2], and see [5] which complements 
the concentration results in [18] for the case of errors with only second 
moments) for Xj X Koy/log(p)/n (recall that Kq = 1 in the sub-Gaussian 
case and Kq = K in the (strongly) bounded case) 

UA% - 7?)lln = Op(^Ai), - 7°||i = OfaXj). 



The condition K 2 Sjy/\og(jp)/n is used in the (strongly) bounded case to be 
able to conclude that the empirical compatibility condition holds (see [5], 
Section 6.12). In the sub-Gaussian case, we use that yj Sj log(p) / n = o(l) 
and an extension of Theorem 1 in [26] from the Gaussian case to the sub- 
Gaussian case. This gives again that the empirical compatibility condition 
holds. 

We further find that 

~ Ijh = O ¥ (K \Jsjlog(p)/n). 

To show this, we first introduce the notation (3 T YP (3 := \\X{3\\ 2 . Then in the 
(strongly) bounded case 



\\xp\\l-\\xpf 



< us - ^WooM 2 = o f (k 2 y^gW^ml 



Since — Jj\\i = Op(KoSj ylog(p) / n) and the smallest eigenvalue A^ in of 
E stays away from zero, this gives 

O f (K 2 Sj log(p)/n) = - 7 J)||* 

> A min || 7j ~ 7j III " ¥ (K^ S 2 (log(p)/nf/ 2 ) 



> AminllTj - 7? Hi - of(Kq log(p)/n) 
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where we again used that KQSjy/\og(jp)/n = o(l). In the sub-Gaussian case, 
the result for the || • ^-estimation error follows by similar arguments invoking 
again a sub-Gaussian extension of Theorem 1 in [26]. 

We moreover have 

\ f j ~ T j\ = \vjvj/n - Tj\ + {rjjX-jijj - Jj)/n\ 

" v ' V v ' 

/ II 
+ \r,jX^/n\ + \(^)V j X_ j (y j - 7°)A*| • 

V v ' V v ' 

III IV 
Now, since we assume fourth moments of the errors, 

I = Ov{Kln- 1 ' 2 ). 

Moreover, 

// = 9p(tfoVlog(p)AOIl7j - 7° Ill = OviKfa Iog(p)/n). 
As for we have 

III = P (KoVlog(p)/n)||7^||i = F (K Q y/ Sj \og(p)/n) 
since ||t?||i < ^/s]\\rf\\ 2 = 0(y/sj). Finally by the KKT conditions 

||X^.X_,(7, - 7 °)|| 00 = P (KoVlog(p)/n), 

and hence 

IV = C» P (^oVlog(p)/ri)||7°l|i = Op(^o^log(p)/n). 
So now we have shown that 

|f| - r?| = ¥ (K^ Sj \og{p)/n). 
Since 1/t? = 0(1), this implies that also 

1/fJ - l/r| = ¥ {K^ Sj \og(p)/n). 

We conclude that 

119, - ej||! = ||C,/f| - C5/^||! < \\% - 7°||i/fJ + || 7 °||i(l/r| - 1/7?), 

V v ' * v ' 

i ii 
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where 

i = Op(K Sj^log{p)/n) 
since ff is a consistent estimator of and 1/t? = 0(1), and also 



m = Op(K Q Sjy/\og{p)/n), 

since H-^lli = O(v^J)- 
Recall that 

\\% - 7JII2 = O ¥ (K 0y / Sj log( P )/n). 

But then 

119, - eJHa < || 7j - 7°ll2/r| + ||7°I| 2 (1/tJ - l/rf) 



= O ¥ {K ^ Sj \og{p)/n). 
For the last part, we write 

QjTPQj-Qjj = (e J -e J ) T s^G J -e J )+2(e J ) T s (e i -e J )+(e J ) T s e°-G iJ 

= (6, - ejfs ^ - 9°) + 2(1/7* - l/r|), 
since (e°) T S° = ej, (e°) T S e? = e^, Qjj = 1/rf, and 9^ = 1/rj. But 

(0j - 9°) T s°(9 j - 9°) < H^IU^ - 9°||l 

We may also use 

(®j - e J °) T s°(e i - ej) < aLxII©, - ej| 



1 1 2 



□ 



5.7. Proof of Theorem 3.1. Note that 

p(y, lift) = p(y, Xifi ) + p(y, di)xi0 - 
where a, is a point intermediating Xi/3 and so that \di — Xij3\ < |xj(/3 ■ 

n- 

We find by the Lipschitz condition on p (Condition ) 

\p(y, di)xi0 - /3°) - p(y, x^)x t - /3°)| 
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= \a i -x0\\x i 0-p O )\\x i 0-P°)\ 2 . 
Thus, using that by Condition (C5) = Of{K) uniformly in i, 

QjPnPp = QfPnPp + QjPnPpW ~ P°) + Rem!, 

where 

n 

Re mi = Op(K) ]T \ Xi - (3°)\ 2 /n = O(K)\\X0 - P°)f n 

i=l 

= O P (Ks X 2 ) = op(l) 
where we used Condition (C2) and in the last step Condition (C8). 
We know that by Condition (C4) 

\\ejp n p - ejiu = O(K). 

It follows that 

h - P°i = h ~ $ ~ ®J P nPp 

= ft - $ - QjPnppo - ejP n p00 - - Remi 

= ejP nPj3 o - (eJP n p - ej)0 - /3°) - Remi = ejP nP/3 o - Rem 2 , 

where 

|Rem 2 | < |Remi| + 0(A*)||/3 - = op(n" 1/2 ) + P (s o AA*) = o^n' 1 / 2 ) 

since by Condition (C2) ||/3 — = Op(Aso)) and by the second part of 
Condition (C8) also A*Aso = o{n~ l l 2 ). 

We now have to show that our estimator of the variance is consistent. We 
find 

\(QPppop T p e T ) j:j - (ep n/ yJe T y 
< |(G(p n - p) Pp0 p T p0 @ T ) n \ + \{®Pppp T po e T ) 3J - (QP P ^e T ) n \ . 

% « ' * . ' 

/ // 

But, writing ek,i '■= (P n — P)pk,/3°Pi,p°i we see that 

/ = \&(P n -P)p p0 f% e T ) jj \ = |^G i)fc G jy e M | < He.-llfllelloc = <D F (s*K 2 \), 

k,l 
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where we used Conditions (C5) and (C6). 
Next, we will handle //. We have 

Pfi(y,x)p^(y,x)-p^(y,x)p^(y,x) = [p 2 {y-x^)-p 2 {y-x^)]x T x := w{y,x)x T x, 
with 

\w(y, x)\ := \p\y - xl3) - p\y - x[3°)\ = O F (l)\x0 - 0°)\, 

where we use that p is locally bounded (Condition (CI)). It follows from 
Condition (C2) that 

P\w\ < ^JP\w\ 2 = O(A^io). 
Moreover by Condition (C5) 

liej^iu = o p (k) 

so that 

\(@w{x,y)x T x@ T ) jj \ < 0(K 2 )\w(y,x)\. 

Thus 

It follows that 

1 + 11 = P (K 2 s*\) + O v (K 2 ^X) = op(l) 
by the last part of Condition (C8). □ 
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6. Supplemental Section. 

6.1. Proofs for results in Section 2.3.3. 

6.1.1. Proof of Theorem 2.3. Arguing as in the beginning of Section 2.3.3, 
we consider the submodel parameterized by f3\ only and in which the mean 
of Y is shifted by f3\ \X\— E[Xi|Z]) . In this submodel, the efficient estimator 
is asymptotically normal with variance o\ /Var(Xi — E[Xi|.Z]). Now, Corol- 
lary 2.1, ensures that &Lasso,j is asymptotically normal with mean a^ n jj. 
However, by (14) and Corollary 2.1, £l n ,jj ~^ Var(Xi — E[Xi|Z]j. Moreover, 
by Corollary 2.1 again, the convergence is uniform and hence regularity for 
one-dimensional submodels follows. 

6.1.2. Proof that model (15) satisfies (14) and conditions of Theorem 2.2. 
We use Z = {Xj)f =2 and K(Z) = J2 P j= 2 Pj X j- The parameter (5 n equals 

Pn 

(3 n = a( Pn ,(3°, P° ) = argmin Q E[(Y - ]T ajXj) 2 ]. 

3=1 

Since j3° £ B has finite support S(/3°), there exists such that 

13] = a( Pn , /3°, P x )j = $ Vi = 1, . . . ,pn Vn > n(/3°). 

In fact, we can choose n(/3 ) = min{n; {1, . . . ,p n } 5 S(f3 )}. This implies 
the first condition in (14), namely that K(Z) - Y%=iPj X j = for a11 
n > n(/3°). 
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Using 7n = argmin K E[(Xi - Ej= 2 K j x i) 2 ] and the fact that = 
E^= 2 7j^i where 7 = J (Fx) has finite sparsity, we can use exactly the 
same argument as above to conclude that the second condition in (14) holds, 
namely E[Xi|Z] — X^"i7j^Q = f° r all n greater than some 71(7°). The 
last condition in (14) follows then as well. 

Finally, since (3° and also 7 have finite sparsity, the projected parameters 
j3 n and 7™ have finite sparsity as well (because the projected are equal to 
the non-projected values for n sufficiently large). Hence, the conditions of 
Theorem 2.2 hold. □ 

6.2. Proofs for results in Section 3.3. 

6.2.1. Proof of Theorem 3.2. We can write 



Hence 



X 0j = X 



By definition 




This implies 



W X $-j(%j ~ 7/3°,j)\\ 2 n + Aill^Jl 



But by the Cauchy-Schwarz inequality 



(WpW^ripoj,^^^ - 7°o J ))„ - (v^j,^-po,-j(%j ~ 7/joj)) 

< \\(w} Wj* - QnpjUXp^tyj - 7? 0)i )lln. 



n 



Since 




we get 




O(K 2 )\\X0-f3°)\\l = O F (K 2 X 2 s ), 



33 



where in the last step, we used the third (Lipschitz) part of the conditions on 
the weights, as well as the conditions on the rate of convergence of ||X(/3 — 
/3°)||^. Now, for arbitrary 5 > we have 

2ab < 5a 2 + b 2 /5. 

Hence, we get for arbitrary < 5 < 1 

(1 - 5)11X^^.(7^. - 7/J°,i)lln + AjIIt^Ji 
< 2(r, p o d ,X pa _ j (fr^ - 7? 0j ))n + AjIItJoJIi + O F (K 2 X 2 s ). 

Here, we invoked that HX^o^ — /3°)||n = Cp||X^(/3 — /3°)||n since the weights 
stay away from zero and infinity. 

This implies by the same arguments as in Theorem 2.4 

W% ~ ^ojlli = F (K 2 s jy /log(p)/n) + O ¥ (K 2 \ 2 s /\ J ) 

and 

Wlpj ~ 7/?o j h = ¥ (K^ Sj log{p)/n) + Ov(K\y/s Q ). 

Indeed, it is easy to see that it the compatibility condition holds with S^o 
since it is non-singular with smallest eigenvalue staying away from zero. 
Since K 2 Sj^/\og{p)/n = o(l), the compatibility also holds for Ego with a 
slightly smaller compatibility constant. But then it also holds for because 
the weights stay away from zero and infinity. This argument can then used 
as well to obtain the rate in £2, as in Theorem 2.4. Next, by definition 

Insert 
and 

(X 0J - = WfiWfihpj + Xpjtffc - 7? 0J )). 

We then get 

f h- T hi = (*) + (**). 

where 

(ii) := Xj - (wjWp 2 - /) (Xpo d - Xpo^fyJ/n. 
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We can treat (i) in the same way as in Theorem 2.4 to find 



As for (ii), since \\Xao j\\oo = 0(K) and 



IX^^IU < llX^^Hoc + O(K)\\i 0d - 7 % Ji = Ow(K) 



we get 



(ii) = Ov(K 2 )J2 



n I w 2 A - w 2 



1=1 



;,/3 " 



So we arrive at 



l f L " T hj\ = ¥ (K^ Sj \og{p)/n) + O p (K 2 ^X). 

The rest of the proof goes along the lines of the proof of Theorem 2.4. □ 

The last result of Theorem 3.2 is actually a direct corollary of the following 
simple lemma. 

Lemma 6.1. Let A be a symmetric (p x p)-matrix with largest eigenvalue 
and v and v 6 MP. Then 



\v T Aii - v T Av\ < ( P||oo||£ - v\\l ) A ( A^||u - v\\l 



+ 2 IIAulloollw-ulli A \\Av\\ 2 \\v-v 



v-v\\ 2 ) A ( A 2 A \\v - v\\l , 



Proof. It is clear that 

v T Av - v T Av = (v - v) T A(v -v) + 2v T A(v - v) 
The result follows therefore from 

\{v - v) T A(v - v)\ < 
\v T A(v - v)\ < \\Av\\oo\\v - v\\i 

and 

\v T A(v> — v)\ < ||Au||2||u — v\ 
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□ 



6.2.2. Proof of Lemma 3.1. It holds that 

n n 
i=l i=l 

= Op(y/log(p)/n)\\Q $tj - Qpoji = o ¥ (n^ 2 ) 
by Corollary 3.1. For the second result, we use Lemma 6.1. We get 



|0j AQj-iepajfAQpoj] < P||ocOp(l/ log(p))+pe /3 o J || oo0p (l/ v / log(p)) = <*(!) 



We thus have that 



+ Op(l). 



The random variables (@poj) T xj£i are bounded, since < 1 and (x^O^o j| = 
0(1), and the xf&'s are i.i.d.: thus, the Lindeberg condition is fulfilled and 
the asymptotic normality follows from this. □ 
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